Deep Encrypted Traffic Detection: An Anomaly Detection Framework for Encryption Traffic Based on Parallel Automatic Feature Extraction

With an increasing number of network attacks using encrypted communication, the anomaly detection of encryption traffic is of great importance to ensure reliable network operation. However, the existing feature extraction methods for encrypted traffic anomaly detection have difficulties in extracting features, resulting in their low efficiency. In this paper, we propose a framework of encrypted traffic anomaly detection based on parallel automatic feature extraction, called deep encrypted traffic detection (DETD). The proposed DETD uses a parallel small-scale multilayer stack autoencoder to extract local traffic features from encrypted traffic and then adopts an L1 regularization-based feature selection algorithm to select the most representative feature set for the final encrypted traffic anomaly detection task. The experimental results show that DETD has promising robustness in feature extraction, i.e., the feature extraction efficiency of DETD is 66% higher than that of the conventional stacked autoencoder, and the anomaly detection performance is as high as 99.998%, and thus DETD outperforms the deep full-range framework and other neural network anomaly detection algorithms.


Introduction
With the increasing scale of Internet users, the Internet has begun to carry an increasing number of emerging network applications, and accurate trafc classifcation is the premise of the basic tasks of the network. Especially with the wide application of encryption data transmission, network trafc encryption is becoming a standard [1][2][3][4]. Encryption will make abnormal behaviors in the network such as botnet [5], worm [6], image transmission [7,8], and denial of service attack [9] more covert. Terefore, how to detect malicious encryption trafc without decryption is the present difculty of trafc monitoring, which poses new challenges to trafc anomaly detection.
As a heuristic work, Anderson and Mcgrew proposed expanding the existing trafc anomaly detection method without decrypting the network trafc [10][11][12]. In their work, the feature set with prominent discrimination is selected from unencrypted transport layer security (TLS) handshake information, DNS response information related to the destination IP address in TLS fow, and header information of HTTP fow within the 5-minute window of the same IP source address, and the network trafc with malicious behavior is identifed from encrypted network trafc by the machine learning method. Inspired by efcient feature extraction capabilities of deep learning technology [13][14][15], Wei et al. [16] used a one-dimensional convolutional neural network (1D-CNN) to better ft encryption trafc data based on Anderson and other predecessors' work. In 2018, Yang et al. [17] proposed two deep learning methods to classify encryption trafc. One is to extract encrypted trafc features from the autoencoder, and the other is to use a convolutional neural network to learn highdimensional features of encrypted trafc. Both deep learning methods can extract features from stream metadata, package size, package arrival time, and unencrypted TLS header information. Moreover, their experimental results verifed that the convolutional neural network is superior to the autoencoder, as well as other competitive algorithms, for feature extraction. Zeng et al. proposed a deep full-range (DFR) anomaly detection framework [18]. Generally speaking, for traditional machine learning anomaly detection algorithms, a machine learning network (such as LSTM and SAE) was frst used to extract features. After feature extraction, the L1 regularization method was used to screen features to reduce the computation of anomaly detection.
However, the traditional machine learning methods using statistics for trafc anomaly detection have great disadvantages in feature selection. First of all, the quality of feature selection depends not only on strong expert information but also on private information to a certain extent, which is undoubtedly a very resource-consuming task. Moreover, deep learning methods for trafc anomaly detection, such as Wei's method [16], focused only on the structure of the deep neural network model to extract many payload bytes from the original trafc data packages. In addition, all the extracted payload bytes are global features in trafc data packages, which may have the fatal weaknesses, such as feature dimension redundancy, a large computation amount, poor detection performance, and insufcient feature extraction efciency.
To overcome the shortcomings of the above methods, we propose an encrypted trafc anomaly detection framework based on parallel automatic feature extraction, called deep encryption trafc detection (DETD), in this work. Te proposed DETD extracts local features in encrypted trafc by using small-scale parallel stacked self-encoder layers and then uses feature fltering to extract efective information that strongly indicates encrypted trafc. Specifcally, the parallel feature extraction module used in DETD efectively retains the characteristics of encrypted trafc, which can improve feature extraction efciency and efectively reduce the delay of encrypted trafc classifcation. Te main contribution of this paper lies in the following three aspects: (1) We propose an encrypted trafc anomaly detection framework, including encrypted trafc packet pretreatment, parallel automatic feature extraction, feature selection, and anomaly detection classifcation. (2) We introduce a small-scale parallel automatic feature extraction algorithm that can efectively extract the local features of encrypted trafc and greatly improve feature extraction efciency. (3) We design an L1 regularization-based feature selection algorithm to select the most representative feature set for the fnal encrypted trafc anomaly detection task.
Te rest of this paper is organized as follows. Section 2 presents the related work. Section 3 introduces the proposed DETD framework. Sections 4 and 5 present the experimental evaluation and discussion, respectively. Section 6 concludes the paper.

Related Work
Although traditional trafc anomaly detection methods have made certain progress [19][20][21][22], most traditional anomaly detection algorithms are not suitable for encrypted trafc. In the problem of trafc anomaly detection, encrypted trafc communication and unencrypted trafc communication greatly difer. First of all, the trafc features after encryption have changed greatly, and most content-based anomaly detection methods, such as deep package detection algorithms, are difcult to apply to encrypted trafc algorithms [23]. Second, encryption protocols are often accompanied by trafc masquerading techniques (such as protocol confusion and protocol variation), which transform encrypted trafc characteristics into commonly used trafc characteristics, bringing great difculties to trafc anomaly detection [24]. Because encryption technology encrypts only the payload, the anomaly detection method based on data stream characteristics is less afected by encryption. According to the diferent ways in which data stream features are used for encrypted trafc anomaly detection, we can divide these encrypted trafc anomaly detection methods into the following two categories: (1) manual feature selection method and (2) automatic feature extraction approach.
Te anomaly detection methods based on manual feature extraction extract feature sets that are helpful for anomaly detection through expert information, such as the duration of the stream, the number of bytes of the stream per unit time, the arrival times in the forward and backward directions, and the size and density distribution of the stream. Lakhina et al. used the distribution of data package characteristics (IP address and port) to detect and identify large-scale anomaly trafc [25]. Te experimental results showed that the clustering method can efectively divide normal trafc and anomalous trafc into diferent clusters and can be used to fnd new anomalous trafc. Soule et al. proposed a method based on a trafc matrix to identify anomalous trafc [26]. In their method, Kalman fltering was frst used to identify normal trafc, and then the threshold, variance analysis, wavelet transform, and generalized likelihood ratio were used to identify anomalous trafc. Te ROC curve showed that their method can achieve a better balance between false positives and false negatives.
Te core of anomaly detection methods based on automatic feature extraction is to use the powerful ftting ability of deep neural networks to automatically extract feature sets suitable for anomaly detection tasks from encrypted trafc for fnal anomaly classifcation. Compared with traditional machine learning methods, automatic feature extraction methods have a deeper level of learning ability. Terefore, these approaches have a wide range of application scenarios in industry or academia, such as machinery fault diagnosis [27][28][29], network stream detection [30], botnet detection [31], and intrusion collaborative detection [32]. As mentioned in these references, the automatic feature extraction methods always adopt neural networks to extract features. For example, Odiathevar et al. [30] developed an online ofine framework for anomaly trafc detection in network streams. In this framework, the authors adopted the learned knowledge of the ofine model as the bias for selecting the training data for the online model, so that it can be used with any deep learning method and any anomaly detection algorithm. Moreover, Kim et al. [31] proposed a botnet detection method that can capture periodicity in network data, which is the key to detecting various botnets exhibiting sequential patterns by using recurrent neural networks. Te proposed method also can detect botnets in an online manner based a new anomaly scoring function representing the maliciousness of network connections. Similar to this work, Wang et al. [32] introduced an intrusion collaborative detection framework based on confdence.
However, these aforementioned approaches may have certain shortcomings. First, the manual feature selection method requires expensive expert information and labor costs, and thus the advantages and disadvantages of selecting features depend entirely on expert experience. In addition, the automatic feature extraction method is inefcient in feature extraction, which may have a certain redundancy in feature extraction and may have the defects of poor detection performance and a large calculation amount.

The Proposed DETD Framework
In this section, we present our proposed DETD encrypted trafc anomaly detection framework in detail. As shown in Figure 1, the proposed DETD can be divided into four functional modules: the encrypted data package preprocessing module (pretreatment module), parallel SAE automatic feature extraction module (PSAE module), feature selection module, and anomaly detection module. Te encryption data package preprocessing module is mainly used to clean the interference data in the data package, repeat fles, and read the original trafc data package.

Pretreatment Module.
Te data package pretreatment module mainly reads the contents of encrypted trafc data packages through data package analysis and then converts them into the data format required by subsequent modules. Te main reasons for adopting pretreatment are as follows: (1) the original data contain information that may interfere with anomaly detection, such as port numbers or MAC addresses; and (2) the original trafc data from the network have diferent scales, which is not an ideal input format for a deep neural network model. Te preprocessing module includes four steps: fow purifcation, fow parsing, data normalization, and data block, as shown in Figure 2.
(i) Flow Purifcation. Tis step ensures that our proposed method is free from interference data, duplicate fles, and empty fles in the trafc packets. Te data fow purifcation process is used to screen some of the TCP or UDP headers, some of the data link layers, and Ethernet-related data, such as MAC addresses. (ii) Flow Parsing. In this data stream parsing process, we use Python's third-party standard library, the Scapy library, to read data from the encrypted stream packets. Notably, data feature recovery may cause data feature loss. To retain the data feature as much as possible, each byte in the stream data is set as the feature value x n m , where x n m ∈ D d (d is the dimension of dataset D), and x n m represents the feature of column n in the m-th data stream. Considering that the following feature extraction modules need a unifed input format, we fll the analyzed data according to the overall data situation. First, set a maximum investigation value (MIV), which indicates the number of data features contained in each piece of data x m , and then fll the data with zero padding if the number of data features n is less than MIV. Otherwise, the redundant feature data will be truncated. By default, we set MIV as 784 dimensions. Each piece of data is calculated as follows: (iii) Data Normalization. Diferent evaluation indices often have diferent dimensions and dimensional units. Te data normalization processing formula eliminates the dimensional infuence between indices and improves comparability between data indices. After normalization of the original data, each index is on the same order of magnitude, which  Computational Intelligence and Neuroscience is suitable for comprehensive comparative evaluation. Te calculation formula is as follows: where max(x i m ) and min(x j m ) are the parts with the largest and smallest feature values in the m-th piece of data, respectively. (iv) Data Block. First, dataset D is randomly divided into training sets D train and D test at a ratio of 7 : 3. Ten, we divide the obtained training sets and test sets into data blocks. To extract lightweight features, a piece of data x m is divided into several equal parts on average. Here, we divide a piece of data into 28 equal parts, with an average of 28 bytes per block.
To better illustrate the pretreatment module of DETD, we summarize it as Algorithm 1. In particular, let the original encrypted trafc data G(g 1 , g 2 , . . . , g m ) be the preprocessed dataset, and let PT n i denote the i-th data in RT n . After data preprocessing, a pure encrypted trafc dataset can be generated for the next step of DETD.

Parallel SAE Feature Extraction Module.
Te core of the entire DETD framework is the parallel SAE automatic feature extraction module, called PSAE. It is a characteristic of parallel extraction of preprocessed encryption trafc packages by a small-scale stacked autoencoders (SAEs) [33]. Te parallel automatic feature extraction module in the DETD framework is shown in Figure 3. Te parallel SAE feature extraction module comprises two steps: a parallel SAE training process and a hyperparameter tuning process: (i) PSAE Training. Let X be the preprocessed encrypted trafc data packet and X ′ be the reconstructed data After segmenting X, we can obtain m encrypted trafc data segments g i , where G � g 1 , g 2 , . . . , g m . After each small-scale stacked autoencoder is trained, we can obtain the Let Q SAE (X) represent the parallel SAE training process, h(z) be the sigmoid function, and J(X) denote the objective function. We have (ii) Hyperparametric Tuning. Because the parameters (such as learning rate, number of iterations, and batch size) for feature extraction difer between trafc data blocks, we need hyperparameter tuning on the parallel SAE feature extraction module to avoid overftting our model. At the same time, to generate an optimal model, the predefned loss function of a given piece of data is minimized to avoid overftting. We use the root mean square propagation algorithm (RMS-Prop) to train the model. RMS-Prop is an optimizer with pseudocurvature information that normalizes the gradient using the size of the nearest gradient. It is also a robust optimizer that handles random targets well, making it suitable for microbatch learning.
As an optimizer, it normalizes the gradient using the size of the nearest gradient. In addition, it can handle random targets well, making it suitable for microbatch learning. Finally, we output features extracted by hidden layers (see Algorithm 2 for more details). After this process, we obtain a preliminary set of features for anomaly detection classifcation.

Feature Selection Module.
Te method of extracting features by the parallel SAE feature extraction module adopts an unsupervised learning method. Not all of the extracted features are helpful for classifcation tasks. Moreover, even cases where redundant features lead to poor detection may arise. In the DETD framework, an L1 regularization-based feature selection method is adopted to select the p-th feature that contributes to classifcation as an input for anomaly detection [34]. Te L1 regularizationbased feature selection method is based on the SVM linear kernel. Given a dataset D � (x 1 , y 1 ), (x 2 , y 2 ), . . . , where is the classifcation result of the linear SVM, y is the data label, and α denotes the weight control coefcient, we have the following L1-based loss function: By scaling the value of α, the intensity of the L1 word weight attenuation can be controlled. Tus, we can optimize the weight ω i with a low feature contribution rate to 0 to retain the top P features that contribute substantially to anomaly detection classifcation for the fnal intrusion detection module.

Anomaly Detection Module.
Te anomaly detection module is mainly used to train a classifer with superior detection performance on the feature set for fnal anomaly detection. An integrated learning approach, the AdaBoost classifer [35], is used in the DETD framework. Let D train � (x 1 , y 1 ), (x 2 , y 2 ), . . . ,(x n , y n ) be the training set, where Te number of iterations is set to M. We can divide the anomaly detection module into the following steps: (i) Initialize the weight distribution of the training: 1 , ω 1,1 , . . . , ω 1,n ), ω 1,i � (1/n), i � 1, 2, . . . , n. (ii) When m ≤ M:
Purity PT n i following "Flow Purifcation" (5) end for (6) for each i do (7) Cut the length of PT n i to MIV bytes (8) Trafc data normalization (9) Divide the fow into m pieces (10) end for (11) end for ALGORITHM 1: Preprocessing algorithm.

Preprocessed Data
Parallel Feature Extraction Feature Set (c) Calculate the weights in the strong classifer: (d) Update the weight distribution of the training set: where (iii) Te resulting classifer is

Experimental Evaluation
Tis section describes the datasets, evaluation metrics, and experimental results of DETD.

Experimental Dataset and Environment.
As introduced by Dainotti et al. [36], the lack of multiple shareable trafc datasets as test data is the most obvious obstacle to trafc classifcation progress. In our experiment, we used the CTU-13 malicious trafc dataset (provided by the Czech University [37]) to test the performance of DETD. Te dataset we selected was 3.71 GB in size, and the format was PCAP. After data preprocessing, 1.16 million pieces of data were obtained, of which 800,000 were used as the training set and 360,000 were used as the test set. Te experimental environment is confgured as follows: Windows 10 system, CPU i7-7700hq, 16 G memory size, and a 1060 GPU. Te software frameworks for machine learning are TensorFlow and Sklearn.

Evaluation Metrics.
Tree common metrics are used to measure the performance, i.e., accuracy, recall, and F1_score. Accuracy is used to describe the number of correct predictions over all predictions. Recall refers to the number of positive cases correctly predicted by the classifer in the data. F1_score is used to measure both precision and recall. Mathematically, where TP is true positive, namely, the number of correctly classifed cases as a specifc class; FP is false positive, i.e., the number of misclassifed cases that are classifed as positive class; FN, false negative, is the number of cases that should be classifed as positive, rather than a negative result; TN, true negative, is the number of cases that are correctly classifed as not that specifc class; and precision shows how many of the positive predictions made are correct (true positives).

Comparison with State-of-the-Art Methods.
We use our proposed method and eight state-of-the-art automatic feature extraction algorithms [10-12, 17, 18, 30-32] to detect the anomaly of encryption trafc data. Table 1 compares the proposed algorithm with these anomaly detection algorithms. For a better comparison, we only list the bestperforming accuracy parameters. According to Table 1, the experimental results of the DETD framework are obviously better than those of the manual feature extraction method, and the manual feature extraction algorithm has many limitations. First, the complex function uses limited samples and calculation units, the computing power is limited, and its generalization ability is limited by complex classifcation issues. More importantly, shallow models have features that require manual sample extraction. However, manually extracting features is a very laborious task, and the excellent features are largely determined by experience and luck. Using deep learning to extract features is advantageous because it can control the number of hidden layer nodes to a polynomial multiple of the number of input nodes instead of presenting an exponential multiple and has strong expressive power. Compared with these deep learning frameworks (such as [30][31][32]), DETD has great advantages in feature extraction. We can observe that local features for these deep learning methods are better than global features in the problem of encryption trafc anomaly detection. Moreover, the proposed DETD anomaly detection algorithm has improved the AUC index by almost 2.5 efective point redirects. At the same time, the DETD framework is superior to other deep learning feature extraction frameworks in terms of time and computational cost, and the detection performance has also improved. Specially, the used L1-based feature selection in DETD improved the interpretability of using deep learning algorithms for encrypted trafc anomaly detection (see Discussion for more details).

Why Did We Choose Parallel Automatic Feature Extraction?
We focused on the impact of the block-based parallel automatic extraction algorithm and unblocked serial automatic extraction algorithm on anomaly detection. We experimented from the following two aspects. (1) Te efect of diferent feature extraction methods on anomaly detection: because we did not know whether the extracted shallow characteristics or deeper characteristics have a greater impact on the experimental results, we extracted the features to be compared from the three hidden layers of the stacked automatic encoder for fnal anomaly detection.
(2) Feature extraction efciency comparison: the feature extraction efciency comparison was on the time for feature extraction consumed by the training set, the test set, and the entire dataset. To ensure the fairness of the experiment, we did not adopt the feature selection algorithm to avoid differences in detection performance due to the diferent dimensions of features after feature selection. Figure 4 shows the experimental evaluation metrics for nine machine learning classifers of the unblocked serial automatic feature extraction algorithm and parallel automatic feature extraction method. Te used nine machine learning classifers include logistic regression, decision tree classifer, random forest classifers, Naive Bayes, AdaBoost classifers, SVM (linear), SVM (RBF), gradient boosting, and XGBoost classifers.
From Figure 4, we can see that the features output by the fnal hidden layer can substantially improve the classifcation accuracy of each machine learning classifer, and anomaly detection classifers other than the Naive Bayes classifer showed very good performance. Te obtained three metrics of accuracy, recall, and F1_score clearly show that the trafcblocked parallel automatic feature extraction algorithm is better than the unblocked serial feature extraction algorithm, which can efectively solve the problem of encryption trafc anomaly detection.
For feature extraction efciency, we comprehensively compared the feature extraction time of the two feature extraction algorithms on the training set, test set, and the whole dataset. Since the time consumed by blocked automatic feature extraction algorithms in feature extraction varies, for better comparison, we use the average consumption time for the blocked parallel SAE feature extraction algorithm. Table 2 compares the efciency of the two feature extraction algorithms for extracting the encryption trafc feature.
Te experimental results show that the parallel SAE automatic feature extraction algorithm adopted by the core of the DETD framework is superior to the unblocked serial automatic feature extraction algorithm in anomaly detection implementation, and the parallel automatic feature extraction algorithm in feature extraction consumed only 1/3 the time of the unblocked feature extraction algorithm, which can greatly reduce the anomaly detection delay caused by feature extraction.

Why Did
We Choose the L1 Regularization Feature Selection Algorithm? Feature selection can efectively reduce computational cost and largely avoid classifcation accuracy  degradation due to abnormal factors such as noise. In this section, we used a stacked self-encoder to extract encryption trafc features. Te above experimental results show that the third hidden layer output has the best characteristics. To comprehensively compare the advantages of the L1 regularization-based feature selection algorithm over other selection algorithms, we chose the variance threshold method (VT), chi-square test, cross-validation recursive feature elimination, decision tree, and feature selection algorithm commonly used in random forest species. Te experiment was performed in the following two aspects: (1) the feature set dimension after feature selection; and (2) the feature set performance on the weak classifer after feature selection. From the above experiment, we selected the Naive Bayes classifer as the weak classifer. Figure 5 shows the size of the feature set fltered by the six feature selection algorithms. Table 3 shows the performance of the feature set fltered by the six feature selection algorithms on Naive Bayes.
As seen from Figure 5 and Table 3, the feature extraction algorithm based on L1 regularization, decision tree, random forest, and the feature selection algorithm is more advantageous when considering computational cost and time consumption. Te feature set dimension does not exceed 50. Te computational complexity is not high, and the result of the weak classifer Naive Bayes is satisfactory.
Moreover, the feature set obtained by random forest and L1 regularization screening is 99.559% and 99.662%, respectively, when passing through the Naive Bayes classifer. Terefore, in the subsequent experiments, we compared random forest and L1 regularization as feature selection algorithms.

AdaBoost Classifers vs. Other Machine Learning
Classifers. In this part, we verifed the anomaly detection performance of AdaBoost and other anomaly detection classifers after random forest and L1 regularization feature selection. To comprehensively compare the advantages of the AdaBoost classifer, we have slected a total of nine commonly used machine learning classifers: logistic regression, decision tree    Figure 6: Te left column indicates the three experimental evaluation metrics (accuracy, F1_score, and recall) of random forest selection algorithms for machine learning classifer, and the right column indicates the three experimental evaluation metrics of L1 regularizationbased algorithms for classifer. classifer, random forest classifers, Naive Bayes, AdaBoost classifers, SVM (linear), SVM (RBF) gradient boosting, and XGBoost classifers, for a total of nine commonly used machine learning classifers. Figure 6 shows the three test evaluation metrics for nine classifers of two feature selection methods (serial automatic feature extraction algorithm and parallel automatic feature extraction method). Te above three experimental evaluation metrics show that the feature set after screening by the two feature selection algorithms has excellent performance on various anomaly detection classifers, especially the AdaBoost anomaly detection classifer. Te accuracy of the feature set based on the L1 feature selection algorithm on the AdaBoost classifer is as high as 99.998%, which is higher than that of the random forest feature selection algorithm.

Conclusions
Identifying malicious trafc without decryption is currently a major challenge in anomaly detection problems. However, the existing methods always require tedious analysis of various trafc features and attack features to extract features. Aiming at this defciency, we propose a DETD anomaly detection framework, which is applied to the feld of encryption trafc anomaly detection based on deep feature automatic feature extraction. Te experimental results show that the proposed DETD framework has a huge advantage in extracting encrypted trafc features. Te anomaly detection accuracy of DETD is as high as 99.998%, which outperforms other encryption trafc detection algorithms, such as the autoencoder-based method and the CNN-based approach. In other words, these results show that DETD is better than the recently proposed deep learning anomaly detection frameworks, resulting in our proposed framework being suitable for encrypted trafc intrusion detection. In future work, we will further investigate the classifcation problem of weakly labeled samples and unlabeled samples based on anomaly detection. Moreover, how to design various corresponding solutions for diferent types of encrypted trafc anomalies is another research direction.

Data Availability
Te datasets used in this paper are open, which can be downloaded from the Internet.

Conflicts of Interest
Te authors declare that they have no conficts of interest.